feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism by y-coffee-dev · Pull Request #501 · Light-Heart-Labs/DreamServer

y-coffee-dev · 2026-03-20T08:47:50Z

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism

Summary

Adds end-to-end multi-GPU support for NVIDIA systems. The installer now automatically detects multi GPU topology, assigns GPUs to services based on interconnect quality and VRAM capacity, and configures services for multi-gpu usage all without manual intervention. A custom assignment TUI is also available for advanced users.

Architecture

Topology Detection (`nvidia-topo.sh`)

Parses the nvidia-smi topo -m matrix to extract GPU-to-GPU link types and assigns numerical ranks:

GPU Assignment Algorithm (`assign_gpus.py`)

Four-phase pipeline:

Topology Analysis — Parse GPUs and links, build rank matrix
Subset Enumeration — Generate all GPU subsets, sorted by min link rank (desc), size (asc), VRAM (desc). Find the best subset that fits the model; if none fits, greedily span across GPUs
Service Assignment — Allocate remaining GPUs to whisper/comfyui/embeddings based on availability:
- 0 remaining: colocate all services on llama's last GPU
- 1 remaining: all auxiliary services share that GPU
- 2 remaining: whisper gets one, comfyui+embeddings share the other
- 3+ remaining: dedicated GPUs; extras go back to llama
Parallelism Selection — Based on GPU count and min link rank:
- NVLink/XGMI (rank >= 80): tensor parallel (<=3 GPUs) or hybrid (>3 GPUs)
- Same-NUMA PCIe (rank 11-79): pipeline (<=3 GPUs) or hybrid if rank >= 40
- Cross-NUMA (rank <= 10): pipeline only
- Heterogeneous VRAM: proportional tensor split weights

Compose Layering

When GPU_COUNT > 1, the stack adds:

docker-compose.multigpu.yml — llama-server GPU pinning + split mode
extensions/services/*/compose.multigpu.yaml — per-service GPU pinning

Interactive TUI

Multi-GPU systems get a configuration prompt:

[1] Automatic — runs assign_gpus.py with detected topology
[2] Custom — manual GPU-to-service assignment

Non-interactive installs default to automatic assignment.

Test coverage

Automated tests

tests/test-nvidia-topo.sh — Tests topology matrix parsing against 7 fixture files covering 1-GPU through 8-GPU configurations, NVLink/PCIe/NUMA topologies, and edge cases like NIC rows in the matrix
tests/test-assign-gpus.py — Comprehensive pytest suite covering:
- Single GPU: strategy, service sharing, parallelism mode, model-too-large error
- 2-GPU PHB: colocated strategy, pipeline parallelism
- 4-GPU SOC (cross-NUMA): pipeline mode, dedicated strategy
- 4-GPU SYS + NV pairs: mixed topology handling
- 5-GPU NV12 + MLX5: NVLink with NIC filtering
- 8-GPU NV12 full mesh: tensor/hybrid parallelism selection
- 8-GPU NV1/NV2 partial mesh: degraded NVLink handling
- VRAM overflow / span subset scenarios
- Heterogeneous GPU tensor split proportions

Manual hardware testing

Thoroughly tested on several multi-GPU machines with various configurations including (non-exhaustive):

2x NVIDIA RTX 3060
4x NVIDIA RTX 4080
4x NVIDIA RTX 5060 Ti

All tests confirmed correct topology detection, appropriate strategy selection and proper compose overlay.

What changed

New files

File	Purpose
`installers/lib/nvidia-topo.sh`	NVIDIA topology detection library — parses `nvidia-smi topo -m` matrix into structured JSON with link types, ranks, and labels
`scripts/assign_gpus.py`	GPU assignment algorithm — 4-phase pipeline: topology analysis, subset enumeration, service assignment, parallelism selection
`docker-compose.multigpu.yml`	Compose overlay for llama-server with `NVIDIA_VISIBLE_DEVICES`, `LLAMA_ARG_SPLIT_MODE`, and `LLAMA_ARG_TENSOR_SPLIT`
`extensions/services/comfyui/compose.multigpu.yaml`	Per-service GPU pinning overlay for ComfyUI
`extensions/services/whisper/compose.multigpu.yaml`	Per-service GPU pinning overlay for Whisper
`extensions/services/embeddings/compose.multigpu.yaml`	Per-service GPU pinning overlay for Embeddings
`tests/test-nvidia-topo.sh`	Shell tests for topology parsing against fixture matrices
`tests/test-assign-gpus.py`	Python tests covering single GPU, 2-GPU colocated, 4-GPU NVLink/SYS, 5-GPU NVLink, 8-GPU full mesh/partial mesh topologies
`tests/fixtures/topology_json/*.json` (8 files)	JSON topology fixtures: 1-GPU PCIe, 2-GPU PHB, 4-GPU SOC, 4-GPU SYS+NV pairs, 5-GPU NV12+MLX5, 8-GPU NV12 full mesh, 8-GPU NV12+NUMA, 8-GPU NV1/NV2 partial mesh
`tests/fixtures/topology_matrix/*.txt` (7 files)	Raw `nvidia-smi topo -m` output fixtures for shell-level testing

Modified files

File	Change
`installers/phases/01-preflight.sh`	Adds `jq` and `python3` to preflight dependency checks (required by topology detection and assignment)
`installers/phases/02-detection.sh`	Integrates `detect_nvidia_topo()` — populates `GPU_TOPOLOGY_JSON`, `GPU_HAS_NVLINK`, `GPU_TOTAL_VRAM`, `LLM_MODEL_SIZE_MB`
`installers/phases/03-features.sh`	Major expansion — multi-GPU configuration TUI with automatic and custom assignment modes, parallelism selection, env var extraction
`installers/phases/04-requirements.sh`	Adds multi-GPU compose overlay to requirements
`installers/phases/06-directories.sh`	Persists `GPU_ASSIGNMENT_JSON` and per-service GPU UUIDs to `.env`
`installers/lib/constants.sh`	Adds multi-GPU related constants
`installers/lib/tier-map.sh`	Adds multi-GPU tier mappings
`installers/lib/compose-select.sh`	Includes `docker-compose.multigpu.yml` when `GPU_COUNT > 1`
`scripts/resolve-compose-stack.sh`	Accepts `--gpu-count` flag; discovers and merges `compose.multigpu.yaml` from extensions
`scripts/detect-hardware.sh`	Sources `nvidia-topo.sh` for topology detection
`scripts/build-capability-profile.sh`	Reads actual `gpu.count` from capability profile instead of hardcoding `1`
`.env.schema.json`	Adds new env vars: `GPU_ASSIGNMENT_JSON_B64`, `LLAMA_SERVER_GPU_UUIDS`, `LLAMA_ARG_SPLIT_MODE`, `LLAMA_ARG_TENSOR_SPLIT`, `EMBEDDINGS_GPU_UUID`, `COMFYUI_GPU_UUID`, `WHISPER_GPU_UUID`, `N_GPU_LAYERS`

Lightheartdevs

Review: Needs Work

Strong algorithm and good test coverage (561 lines of pytest), but a few issues need resolving before merge:

1. `jq` promoted from optional to required (breaking)

01-preflight.sh now hard-requires jq. This will fail installs on minimal systems (e.g., fresh Debian/Alpine containers) that previously worked fine. Either:

Auto-install jq (like Docker is auto-installed in phase 05), or
Keep it optional with graceful degradation when absent

2. No CI checks have run

This branch has zero CI results. Please push a commit or re-trigger CI so we can see if it passes the test matrix.

3. Docker Compose GPU reservation conflict

docker-compose.multigpu.yml sets both NVIDIA_VISIBLE_DEVICES env var AND deploy.resources.reservations.devices without device_ids. The reservation block will reserve ALL GPUs while the env var tries to limit visibility. These two mechanisms conflict — pick one or wire device_ids dynamically.

4. Minor: duplicate comment line

constants.sh has INSTALL_START_EPOCH listed twice in the "Provides" header comment.

What's good

The topology detection with nvidia-smi topo -m fallback is well-handled
assign_gpus.py algorithm is correct and the O(2^N) subset enumeration is fine for realistic GPU counts
Single-GPU path is preserved (gated on GPU_COUNT > 1)
Graceful degradation when nvidia-smi is absent

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

y-coffee-dev · 2026-03-21T01:13:11Z

@Lightheartdevs Thanks for the thorough review! I adjusted the PR.

1. jq auto-install - Good catch. I've added auto-install logic for jq.
2. CI - Pushed an adjustments commit, this should trigger the CI pipeline
3. Docker Compose GPU reservation - In fact this is not a conflict, the current setup is intentional and correct, as device_ids in the deploy.resources.reservations.devices block can't be set dynamically, because Docker Compose variable interpolation only produces scalar strings, and since device_ids expects a YAML sequence, there's no way to inject a list like ['0', '2'] from an environment variable.
The two mechanisms are layering, ie. the deploy reservation makes all GPUs available at the Docker level, for the NVIDIA container runtime to use NVIDIA_VISIBLE_DEVICES to scope which GPUs are actually visible inside the container at the runtime level.
This is a common approach when you need dynamic per-container GPU assignment in Compose.

4. INSTALL_START_EPOCH duplication - Fixed!

I appreciate the detailed feedback!

Lightheartdevs · 2026-03-23T13:13:19Z

Review Update — Rebase Required Before Merge

Hey @y-coffee-dev, great work addressing the previous review items. The code itself is solid and we want to get this merged. However, we found a critical issue that needs attention first.

🚨 Silent merge bug: `LLM_MODEL_SIZE_MB` will be dropped

Since you branched, we merged #572/#573/#574 which rewrote the model names and URLs in tier-map.sh (Qwen 3 → Qwen 3.5). Your branch adds LLM_MODEL_SIZE_MB to each tier in that same file.

Git reports a clean merge — no conflicts — but the result silently drops all 11 of your LLM_MODEL_SIZE_MB additions. This happens because git sees main's rewrites and your additions as non-overlapping changes within each tier block, and resolves by taking main's version (which has no LLM_MODEL_SIZE_MB).

What breaks: assign_gpus.py gets called with --model-size "" → float("") → ValueError → multi-GPU assignment fails on every install. Single-GPU installs are fine (early return guard), but the entire multi-GPU feature would be DOA.

What's needed

Rebase onto current main (commit 5a932e9)
Re-add LLM_MODEL_SIZE_MB to each tier. The new Qwen 3.5 model sizes (update as needed):

CLOUD:      LLM_MODEL_SIZE_MB=0
ARC:        LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
ARC_LITE:   LLM_MODEL_SIZE_MB=2870    # Qwen3.5-4B-Q4_K_M
NV_ULTRA:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_LARGE:   LLM_MODEL_SIZE_MB=48500   # Qwen3-Coder-Next-Q4_K_M (unchanged)
SH_COMPACT: LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)
Tier 0:     LLM_MODEL_SIZE_MB=1500    # Qwen3.5-2B-Q4_K_M
Tier 1:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 2:     LLM_MODEL_SIZE_MB=5760    # Qwen3.5-9B-Q4_K_M
Tier 3:     LLM_MODEL_SIZE_MB=16400   # Qwen3.5-27B-Q4_K_M
Tier 4:     LLM_MODEL_SIZE_MB=18600   # Qwen3-30B-A3B-Q4_K_M (unchanged)

⚠️ Double-check these against the actual GGUF file sizes on HuggingFace — the Qwen 3.5 models are new and some sizes may differ from the Qwen 3 equivalents you had before.

Push — this should also trigger CI, which hasn't run yet on this branch.

Everything else looks good

We did a full merge simulation and traced every touched installer file. The single-GPU path is completely safe — your guards in 02-detection.sh (GPU_COUNT -gt 1) and 03-features.sh (GPU_COUNT -le 1 → return) are clean. The compose layering, hardware detection additions, and .env generation all use safe defaults. No behavioral changes for existing single-GPU installs on any backend.

Two minor suggestions for a follow-up (non-blocking):

Add trap "rm -f $TOPOLOGY_FILE" EXIT after the mktemp in 03-features.sh to clean up on early exit
Add a # NOTE: keep in sync with assign_gpus.py comment in the custom TUI parallelism logic in 03-features.sh, since it duplicates the threshold logic from the Python script

Looking forward to the rebase — this is a great feature and we want to ship it. 🚀

Co-Authored-By: Claude Opus 4.6 (1M context) noreply@anthropic.com

- Enhanced multi-GPU tier assignment based on topology - Implemented robust GPU topology detection for NVIDIA - Implemented GPU link ranking from the fastest to the slowest, for optimal strategy selection in the future phases - Implemented gathering detailed per-GPU information - Data structures for GPU information storage - Robust and comprehensive test suite for NVIDIA topology detection - Multi-GPU strategy selection algorithm - Careful handling of edge cases and subtle bugs in strategy selection - Robust test suite for multi-GPU strategy selection algorithm GPU assignment and parallelization strategy selection algo, clustering GPUs by topology links to find the optimal setup, Multi-GPU configuration TUI, docker compose overlays for multi-gpu setups Adjust env schema validation Fixed inconsistencies in gpu count, json escaping issues, etc fix issue with writing multigpu overlay fix resolve-compose-stack.sh multi gpu overlay fix gpu device id Refactors + less convoluted docker compose setup N_GPU_LAYERS validation fix multi-gpu overlay

Lightheartdevs · 2026-03-24T18:53:00Z

Code Review — 2026-03-24

Overall: Well-architected PR. The topology detection, strategy selection, and compose overlay pattern are all clean and consistent with existing conventions. Good test coverage (7 fixtures). Not merging yet due to conflicts and testing requirements, but this is on track.

Merge Conflicts (5 files)

This PR was branched before the Lemonade integration (19 PRs merged 2026-03-24). The following files will conflict:

02-detection.sh — we added cpu/apple case handlers (fix(installer): CPU backend wrongly overridden to nvidia when capability profile loaded #596) and the tier assignment block was modified
06-directories.sh — heavily modified for Lemonade .env generation, LiteLLM config, DREAM_MODE
constants.sh — VERSION bumped to 2.4.0
.env.schema.json — several new keys added (DREAM_MODE, LLM_BACKEND, LLM_API_BASE_PATH, TARGET_API_KEY, OPENAI_API_KEY)
resolve-compose-stack.sh — lemonade mode added to compose stack resolution

Action needed: Rebase onto current main and resolve conflicts. We can help with this if needed.

Revision Requests

Verify INTERACTIVE guard on manual GPU assignment. Phase 03 adds interactive prompts for custom GPU assignment. Confirm these are gated by [[ "$INTERACTIVE" == "true" ]] so non-interactive installs (CI, scripted, --yes) don't hang waiting for input.
jq is now a hard dependency. Phase 01 auto-installs jq if missing — this changes jq from optional to required. Acceptable given multi-GPU needs it, but:
- Add jq to the README prerequisites list
- Consider gating the auto-install behind GPU_COUNT > 1 so single-GPU users aren't surprised
LLM_MODEL_SIZE_MB — missing newline at EOF in .env.schema.json. The diff shows the trailing } lost its newline. Minor but fails some linters.
compose.multigpu.yml — NVIDIA_VISIBLE_DEVICES default. Currently defaults to all when LLAMA_SERVER_GPU_UUIDS is empty. On a multi-GPU system where assignment runs, this should always be set. But if the assignment fails silently, all is a safe fallback. Consider logging a warning when falling back.
Temp file cleanup. TOPOLOGY_FILE is created via mktemp in Phase 03 but never cleaned up. Add a trap or explicit rm at the end of the phase.

Testing Requirements

Needs testing on actual multi-GPU hardware (we don't have any in the current test matrix)
The 7 topology fixtures cover detection well, but end-to-end install → compose up → services running on assigned GPUs hasn't been validated
Specifically: verify NVIDIA_VISIBLE_DEVICES with UUID list actually constrains the right containers

What's Good

nvidia-topo.sh is a clean pure-function library — follows the lib/ conventions perfectly
assign_gpus.py strategy engine is well-separated from bash
Compose overlay pattern (compose.multigpu.yaml per service) is consistent with existing GPU overlays
Test fixtures are thorough (1-GPU PCIe through 8-GPU NVLink full mesh with NUMA)
LLM_MODEL_SIZE_MB in tier-map is a useful addition beyond multi-GPU
NVLink vs PCIe strategy selection (tensor split vs pipeline) is the right approach

Status: Defer merge until conflicts resolved and multi-GPU hardware testing available. Happy to help with conflict resolution when ready.

y-coffee-dev · 2026-03-25T01:00:45Z

Hey @Lightheartdevs! Thanks for the thorough review across installer files, really appreciate the care that went into this.

Good catch on the potential silent drop, I rebased and made sure all LLM_MODEL_SIZE_MB values survived.

Revision requests

INTERACTIVE guard: confirmed and added an explicit early-return guard at the top of run_custom() as an extra safety net, on top of the existing call-site gating.

jq: After I checked deeper, it turns out jq was already a hard dependency before this PR. scripts/validate-env.sh has been using it to parse .env.schema.json on every install regardless of GPU count. So single-GPU users were already getting it. Added it to the README prerequisites.

.env.schema.json newline + LLM_MODEL_SIZE_MB: fixed the trailing newline and added LLM_MODEL_SIZE_MB as a proper schema entry

NVIDIA_VISIBLE_DEVICES fallback warning: added a warn in phase 03 right after GPU assignment is extracted.

Temp file cleanup: added trap "rm -f $TOPOLOGY_FILE" EXIT and added an explicit rm -f "$TOPOLOGY_FILE" at the end of the phase so it cleans up promptly.

Hardware testing

I tested this end-to-end on several multi-GPU machines before, install through compose up, services running on assigned GPUs, and confirmed NVIDIA_VISIBLE_DEVICES with UUID lists actually constrains the right containers. That said, totally understand if you'd rather wait until the CI matrix has multi-GPU coverage before merging.

Thanks again for the detailed feedback!

Lightheartdevs · 2026-03-25T03:22:06Z

Code Review — 2026-03-25

Solid multi-GPU feature. Architecture is clean, test coverage is thorough, single-GPU and AMD paths are unaffected. Ready to merge with minor notes:

Non-blocking observations

jq dependency — Phase 02 now requires jq for topology JSON parsing. README mentions it but Phase 01 preflight doesn't check for it. The installer already auto-installs jq per the README note, so this should be fine in practice.
base64 -w0 is GNU-specific (line 293, Phase 06). macOS base64 doesn't support -w0. Not a practical issue since multi-GPU is gated by GPU_BACKEND == nvidia and GPU_COUNT > 1 — no Mac will hit this path. But if anyone adds AMD multi-GPU later, this needs a portable wrapper.
Custom mode duplicates assign_gpus.py parallelism logic in bash (Phase 03, ~line 240). Comment says "keep in sync." Consider calling assign_gpus.py with a custom topology JSON from the custom assignments instead, to avoid divergence.
LLM_MODEL_SIZE_MB — verify this is always set before Phase 03 runs. If the tier map doesn't populate it, the assignment algorithm will error.

What's good

Topology fixture coverage is excellent (8 configs from 1-GPU to 8-GPU NVLink mesh)
Graceful degradation on detection failure (warn, skip, continue)
Compose overlay pattern is consistent with existing AMD/CPU/Apple overlays
Interactive custom mode gives power users full control
Single-GPU installs see zero change

Approved. ✅

Lightheartdevs requested changes Mar 20, 2026

View reviewed changes

y-coffee-dev force-pushed the feat/multi-gpu branch from a49941c to 7ddc533 Compare March 21, 2026 01:07

y-coffee-dev added 5 commits March 24, 2026 13:44

LLAMA_ARG_TENSOR_SPLIT default value

7e3a9b3

Fix critical issues

d037746

More improvements and fixes

8173195

auto-install jq + fix duplicated INSTALL_START_EPOCH

491cfb2

y-coffee-dev added 2 commits March 24, 2026 23:07

quick adjustments

36002f6

quick adjustments

ffcc3a0

y-coffee-dev force-pushed the feat/multi-gpu branch from 7ddc533 to ffcc3a0 Compare March 25, 2026 00:55

Lightheartdevs merged commit d92d789 into Light-Heart-Labs:main Mar 25, 2026
12 of 22 checks passed

This was referenced Mar 25, 2026

fix(ci): lint multi-GPU Python files and sync .env.example #619

Merged

fix(ci): green main — ruff, tier-map tests, _VALID_ACTIONS, schema #620

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism#501
Lightheartdevs merged 7 commits intoLight-Heart-Labs:mainfrom
y-coffee-dev:feat/multi-gpu

y-coffee-dev commented Mar 20, 2026

Uh oh!

Lightheartdevs left a comment

Uh oh!

y-coffee-dev commented Mar 21, 2026

Uh oh!

Lightheartdevs commented Mar 23, 2026

Uh oh!

Lightheartdevs commented Mar 24, 2026

Uh oh!

y-coffee-dev commented Mar 25, 2026

Uh oh!

Lightheartdevs commented Mar 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

y-coffee-dev commented Mar 20, 2026

feat: NVIDIA Multi-GPU Detection, Topology-Aware Assignment & Parallelism

Summary

Architecture

Topology Detection (nvidia-topo.sh)

GPU Assignment Algorithm (assign_gpus.py)

Compose Layering

Interactive TUI

Test coverage

Automated tests

Manual hardware testing

What changed

New files

Modified files

Uh oh!

Lightheartdevs left a comment

Choose a reason for hiding this comment

Review: Needs Work

1. jq promoted from optional to required (breaking)

2. No CI checks have run

3. Docker Compose GPU reservation conflict

4. Minor: duplicate comment line

What's good

Uh oh!

y-coffee-dev commented Mar 21, 2026

Uh oh!

Lightheartdevs commented Mar 23, 2026

Review Update — Rebase Required Before Merge

🚨 Silent merge bug: LLM_MODEL_SIZE_MB will be dropped

What's needed

Everything else looks good

Uh oh!

Lightheartdevs commented Mar 24, 2026

Code Review — 2026-03-24

Merge Conflicts (5 files)

Revision Requests

Testing Requirements

What's Good

Uh oh!

y-coffee-dev commented Mar 25, 2026

Revision requests

Hardware testing

Uh oh!

Lightheartdevs commented Mar 25, 2026

Code Review — 2026-03-25

Non-blocking observations

What's good

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Topology Detection (`nvidia-topo.sh`)

GPU Assignment Algorithm (`assign_gpus.py`)

1. `jq` promoted from optional to required (breaking)

🚨 Silent merge bug: `LLM_MODEL_SIZE_MB` will be dropped